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Abstract 

Wearable cameras offer a hands-free way to record egocentric images of daily experi¬ 
ences, where social events are of special interest. The first step towards detection of 
social events is to track the appearance of multiple persons involved in it. In this paper, 
we propose a novel method to find correspondences of multiple faces in low temporal 
resolution egocentric videos acquired through a wearable camera. This kind of photo¬ 
stream imposes additional challenges to the multi-tracking problem with respect to 
conventional videos. Due to the free motion of the camera and to its low temporal res¬ 
olution, abrupt changes in the field of view, in illumination condition and in the target 
location are highly frequent. To overcome such difficulties, we propose a multi-face 
tracking method that generates a set of tracklets through finding correspondences along 
the whole sequence for each detected face and takes advantage of the tracklets redun¬ 
dancy to deal with unreliable ones. Similar tracklets are grouped into the so called 
extended bag-of-tracklets (eBoT), which is aimed to correspond to a specific person. 
Finally, a prototype tracklet is extracted for each eBoT, where the occurred occlusions 
are estimated by relying on a new measure of confidence. We validated our approach 
over an extensive dataset of egocentric photo-streams and compared it to state of the 
art methods, demonstrating its effectiveness and robustness. 

Keywords: Egocentric vision, face tracking, low frame rate video analysis. 


1. Introduction 

Wearable cameras and egocentric vision are very recent trends from barely the last 
ten years that paved the road for very challenging applications ranging from health- 
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care to sport, security, tourism, and leisure. Wearable cameras from the first person 
point of view record where a person is, what a person does and whom he/she interacts 
with. Thus, egocentric images are potentially useful for understanding the lifestyle of a 
person or quantified self. Egocentric images may also serve as digital memories, being 
particularly suited to boost the memory capabilities of people with memory impairment 
luia. For this particular application, low temporal resolution wearable cameras, such 
as the Vicon Revue (3fpm) and the Narrative Clip (2fpm), are especially suited, since 
they allow the recording of one’s life over a long period of time. However, extracting 
relevant information from egocentric videos with low temporal resolution, hereafter 
called egocentric photo-streams, is not a trivial task. Indeed, a massive number of 
unconstrained collected images can be gathered even over relatively limited period of 
time (up to 3000 images per day using the Narrative Clip). Moreover, given the unpre¬ 
dictability of the camera motion and the low temporal resolution of the camera, abrupt 
changes of scene occur very often in the images. 

During the last few years, several problems related to the analysis and organiza¬ 
tion of egocentric videos have been addressed, from temporal segmentation Ellll and 
summarization BlbllTl to event and action (self action and social interaction) recog¬ 
nition I® IS [loi [m. However, despite the importance of tracking in the analysis of 
social interaction, this problem received less attention in egocentric vision than the 
same problem in conventional videos that has been an active research area for a long 
time 112]. Tracking in egocentric videos and in the special case of them, egocen¬ 
tric photo-streams, is a different problem from the tracking in conventional videos in 
several aspects. Conventional tracking facilitates itself with the assumption of tempo¬ 
ral coherence, while temporal coherence does not hold for egocentric photo-streams. 
Moreover, in egocentric photo-streams, the appearance of the target as well as its po¬ 
sition may change drastically from frame to frame. In addition, due to changes in the 
camera field of view caused by body movement of the camera wearer, background 
modeling becomes a more challenging issue (see Fig. [^l. 

When reviewing the state of the art trackers, two main categories of conventional 
trackers can be found: offline and online trackers. The former category of trackers 
assumes that object detection in all frames has already been performed and trajec- 
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Figure 1; A sequence of images acquired by Narrative clip wearable camera. Free mo¬ 
tion of the camera and the abrupt variation in appearance of target due to low temporal 
resolution of the sensor can be appreciated. 


tory construction is achieved by linking different detections and tracks in offline mode 
|[I3][I1|I5]. This property of offline trackers allows for global optimization of the 
path and thus, makes them potentially suitable for dealing with photo-streams. Berclaz 
et al. lfT3]l reformulate the linking step between detections and trajectories as a con¬ 
strained flow optimization approach, which results in a convex problem that can be 
solved using the k-shortest paths algorithm. In order to overcome the noisy probabili¬ 
ties of candidates that may be produced by the object detector, the authors arranged a 
set of assumptions including the limited motion of the target. Zamir et al. na solve 
the data association problem for one object at a time, while implicitly incorporating 
the rest of the objects using global association by employing Generalized Minimum 
Clique Graphs (GMCP). GMCP incorporates both motion and appearance model over 
the whole temporal span for optimization. In the development of aforementioned track¬ 
ers, the authors assume a rather fixed or predictable position for targets in the adjacent 
frames of the video. Although this assumption is generally applicable in conventional 
videos, it does not hold in egocentric photo-streams. 

In comparison with offline trackers, the target position is provided for online track¬ 
ers in the initial frame and the tracker needs to establish the state of the target in the 
following frames of the video. Among state of the art online trackers, those that are 
relatively tolerant to occlusion and drastic appearance changes, are more suitable for 
egocentric photo-streams maiiiiiiaiiii. Kalal et al. present a Tracking, Learning, 
Detection (TLD) framework EOll . which works by training a discriminative classifier 
over labeled and unlabeled examples. This method performs well in handling short¬ 
term occlusions, but strongly relies on optical flow, which cannot be applied in low 
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temporal resolution sequences. Compressive Tracking (CT) El, uses an appearance 
model based on features extracted in a compressed sensing domain. This method is rel¬ 
atively robust to changes in appearance and performs quite well in challenging datasets, 
outperforming TLD. However, CT is not robust to large displacements of the target, 
which are very frequent in egocentric sequences. In Locally Orderless Tracking (LOT) 
E), target and candidates in the new frame are segmented first into superpixels and 
among the set of candidates, the one which has less distance to the target is selected 
as the target in the new frame. LOT tracker offers adaptation to object appearance 
variations by matching with flexible rigidity through measuring the distance between 
superpixels. Similar to LOT, SuperPixel Tracker (SPT) im extracts superpixels of the 
target. SPT extracts the color histograms of the superpixels from the first 4 frames 
and based on these features, clusters superpixels by using mean-shift. A confidence 
value is assigned to each cluster, from which the superpixels confidence of all pixels 
of the cluster is derived. In the next frame, the candidate window with the highest 
confidence summed over all superpixels in the window is selected as the new target. 
Mei et al. presented LIO ifT^ as a tracker which explicitly detects occlusions. In 
LIO, the candidate windows with a reconstruction error above a threshold are selected 
for LI-minimization. When certain amount of the pixels of the candidate window are 
occluded, LIO detects an occlusion, which disables the model updating. 

Conventional online trackers usually search for the target in the new frame, around 
its previous position in the current frame. These trackers are mostly dependent on the 
object appearance in the very first frames and generally require the feature patches in 
neighboring frames to be close to each other. However, under specific conditions of 
egocentric photo-streams, such presumptions will result in gradual departure of esti¬ 
mated target from the true target state, which eventually leads to tracking loss. 

The work that seems the most similar to ours are the trackers in Low Frame Rate 
(LFR) videos 112111221 . Li et al. present a temporal probabilistic combination of dis¬ 
criminative models of different learning and service period, known as their lifespan 
im. Each model is learned from different ranges of samples, with different subsets of 
features, to achieve varying levels of discriminative power. Different models are fused 
by a cascade particle Alter, to achieve multiple stages of importance sampling. How- 


4 


ever, this work falls into pre-trained tracking class that its performance also depends 
on the training data, an issue that we try to avoid, due to the peculiarity of our dataset 
that presents a relatively small number of images in each trackable segment. A recent 
work about LFR tracking was presented by Zhou et al. Il22l . The authors proposed a 
Nearest Neighbor Field (NNF) driven stochastic sampling framework for abrupt mo¬ 
tion tracking. In this work, NNF provides candidate regions, where the target may 
exist. Smoothing Stochastic Approximate Monte Carlo (SSAMC) sampling scheme 
predicts the state of the target more effectively. Finally, the method rehned the result 
with a sparse representation based template matching technique. 

Although the body of literature regarding tracking is huge, most existing approaches 
cannot be directly applied to egocentric photo-streams, either because of the unpre¬ 
dictability of motion or because of drastic appearance changes that characterize this 
data. Furthermore, most of the methods are not able to track multiple targets simul¬ 
taneously or require the manual specihcation of the initial position of the target. To 
this end, we previously proposed the Bag-of-Tracklets (BoT) ll23l for tracking in ego¬ 
centric photo-streams acquired by Sensecam camera (3 fpm). The underlying key idea 
of our approach is that detection and tracking can be integrated to achieve strong dis¬ 
criminative power. This approach belongs to the offline class of trackers, that allows 
for general optimization of tracklets. Optimization consists of generating a tracklet 
for each detected target and categorizing similar tracklets into groups, that should cor¬ 
respond to different persons. This approach simply allows for rejection of unreliable 
bag-of-tracklets, and eventually extracts a single prototype for each reliable bag-of- 
tracklets. 

In this manuscript, we present an extended-Bag-of-Tracklets (eBoT) approach by 
introducing several features that help in increasing BoT robustness even in photo¬ 
streams acquired by cameras with lower frame rates (2 fpm) and narrower fields of 
view: 

• To manage the close appearance of people to the camera, eBoT reliably detects 
people characterizing them by their face instead of their body. 

• eBoT to handle target deformations and scale variations, employs a new ap- 
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proach for finding correspondences based on an average deep matching score. 


• eBoT presents a more robust way to compute the prototype of the bag of track- 
lets. 

• eBoT is tolerant of face occlusions and is able to explicitly localize them. 

• eBoT introduces a conhdence term to measure the reliability of the prototypes. 

• eBoT is compared to six models of the state of the art by using an enlarged set 
of metrics from the CLEAR MOT ll24l framework over an enlarged dataset. 

The rest of the paper is organized as follows: in Sec. we define the Conhdence- 
based eBoT for multi-face tracking, by performing seed and tracklet generation, group¬ 
ing tracklets, prototypes extraction and occlusion treatment. In Sec. we introduce 
our experimental setup and discuss comparative results and hnally, in Sec. we end 
the paper by drawing conclusions and sketching future work. 

2. Confidence-based extended Bag-of-Tracklets for multi-face tracking 

People during a full day may often engage in a social event. A social event typi¬ 
cally happens in a specific environment with specific people. Thus, by wearing a wear¬ 
able camera one captures those specific moments that are of interest for later retrieval. 
However, the first step towards social event retrieval from images is to find and track 
the appearance of people around the camera wearer. Precisely, people who get engaged 
in a social event with the camera wearer appear in reasonable number of consecutive 
frames while irrelevant people to the camera wearer only appear occasionally in the 
photos and normally do not stay in front of the camera wearer for a long time. Thus, 
by incorporating additional information about the tracked people, their involvement in 
the social interaction with the user can be proved ll25l . Our approach to track multiple 
faces in egocentric photo-streams consists of four main steps: seed and tracklet gen¬ 
eration, grouping tracklets into Bag-of-tracklets, prototypes extraction and occlusion 
treatment. 
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2.1. Seed and Tracklet Generation 

Prior to any computation, the first step of the proposed method is to organize 
the long and unconstrained egocentric photo-streams into homogeneous temporal seg¬ 
ments. To this end, we apply R-clustering JT), an unsupervised temporal segmentation 
method, specifically formulated for egocentric photo-streams. R-clustering consists 
in a Graph-Cut algorithm that finds a trade-off between the under-segmentation pro¬ 
duced by a concept drift detector, and the over-segmentation resulting from agglomer- 
ative clustering. The clustering is performed over global features of images extracted 
through Convolutional Neural Networks to divide the photo-streams into structured 
segments. 

Among the set of created segments from the temporal segmentation step, those that 
contain trackable persons are of particular interest for our purpose. To determine if 
a segment contains trackable persons, we evaluate the ratio between the number of 
frames with detected faces and the number of frames of the segment. If the ratio is 
higher than a predehned threshold (0.5 in this work), then the segment is considered 
as a segment containing trackable persons. As output of this phase, we collect a set of 
bounding boxes that surround the face of each person throughout the sequence, that we 
call seeds. The generated seeds are shown by red bounding boxes in Fig. 

Due to the nature of our photos, an in the wild face detector 12^ that substantially 
outperforms state of the art face detectors Ea, is applied on each frame of the extracted 
segments to detect visible faces. The detector is based on mixture of trees with a shared 
pool of parts, where, every facial landmark is defined as a part and a global mixtures is 
used to model topological changes due to the viewpoint. Different mixtures share part 
templates that allows modeling a large number of views with low complexity. More¬ 
over, as shown by the authors, tree-structured models perform effectively at capturing 
global elastic deformation, while being easy to optimize using dynamic programming. 
Global mixtures can also be used to capture large deformation changes for a single 
viewpoint, such as changes in expression. Despite the relatively good performance of 
the detector, it sometimes produces some false positives or false negatives due to the 
blurring effect that happens frequently in egocentric photos. 

Hereafter, we denote each segment containing trackable persons simply by se- 
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Figure 2; Detected faces (seeds) are shown by red bounding boxes. An example of 
false negatives can be observed in frames 8 and 9. Only a sub-sample of the original 
sequence is shown. 


quence. For each seed, we generate a set of correspondences to the seed along the 
sequence, called tracklet, by propagating the seed in the sequence forward and back¬ 
ward using a similarity measure to be detailed below. As a result, a tracklet T* = 
{tl ,..., ,..., fg} associated to the seed i found at time s begins in a time b, where the 

backward tracking ends (first frame in the sequence), and ends at time e, where the 
forward tracking ends (last frame in the sequence). In the rest of the paper, we will 
keep the convention of using the variable t to refer to the bounding box surrounding 
the faces, the upper-index to identify the tracklet, and the sub-index to identify the 
frame. Note that theoretically, the number of generated tracklets should be of order of 
the number of found seeds. For example, in the ideal case where face detector does not 
fail, two persons appearing in all the 100 frames of the sequence, would generate 200 
tracklets, each one of length 100 frames. 

To propagate a seed found in frame s, backward and forward, we look at every 
frame of the sequence to the region most similar to the seed. In order to deal with 
abrupt displacements of the target, we generate the set of sample regions with a sliding 
window. However this approach generate a very high number of samples for each 
image, to reduce computational complexity, we reject all samples whose similarity to 
the seed in the HSV color space is lower than a pre-defined threshold. The size of 
the sliding window depends on the size of the seed that we are considering. However, 
since the face region in each frame can vary largely by its distance variation from the 
camera, we also consider as samples all previously detected seeds in that frame. 

The similarity between the seed and each sample in a frame of the sequence is 
measured by its average deep-matching score ll28l . The deep matching is conceived 
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as a 2D-warping, that is able to deal with various kinds of object-induced or camera- 
induced image deformations, including scaling factors and rotations. Instead of using 
SIFT patches as descriptors, each SIFT patch is split into four so-called quadrants and, 
assuming independent motion (within some extent) of each of the four quadrants, the 
similarity is computed to optimize the positions of the four quadrants of the target 
descriptor. 



Figure 3: An example of a tracklet generated based on deep matching. The red box 
corresponds to the seed that the tracklet is generated from it. The green box in each 
frame corresponds to the sample with the highest deep matching score to the seed. 


For simplicity, let us consider two sequences of R-dimensional descriptors in a ID 
warping case: the reference, that corresponds to the seed, say Pg = and the 

target, say Pt = that corresponds to a sample in a frame. The optimal warp¬ 

ing between them is defined by the function w* : {0,..., i? — 1} —>■ {0,..., i? — 1} 
that maximizes the average value of similarities between their elements: 

A{w*) = max S{w) = max M^{sim{Ps{i), Pt{w{i)))}i^o,....R-i (1) 

wGW wGW 

where w{i) returns the position of element i in Pt, Mi is the average value of the set 
of similarity values generate by varying i and sim is the non-negative cosine similarity 
between pixel gradients. The deep matching algorithm is built upon a multi-stage archi¬ 
tecture that interleaves convolutions and max-pooling at three different scales among 
the feasible warpings between descriptors. The set of feasible warpings W is defined 
recursively so that hnding the optimal warping w* can be done efficiently by a dynamic 
programming strategy. Fig. [^illustrates an example of a generated tracklet based on 
deep matching for one of the seeds in the sequence shown in Figj^ The seed is de¬ 
picted by red bounding boxes, green bounding boxes correspond to the samples with 
highest deep matching score to the seed in every frame. As can be seen, the tracklet 
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corresponds to the same person who generated the seed. 


2.2. Grouping tracklets into Bag-of-tracklets 

We assume that tracklets generated by seeds belong to the same person in the se¬ 
quence, and are very likely to be similar to each other; we aim to group them into a set 
of eBoTs, where there is no intersection between eBoTs by definition. Let us consider 
an eBoT, say T, as a set containing a tracklet, T = {T*}, where T* does not belong to 
any other eBoT. Also, let us consider another tracklet that has not been assigned to 
any eBoT yet. Let t\, and tl. be the bounding boxes, where the person is detected (by 
the face detector or by the tracker) at frame k for tracklets T* and , respectively. 

We define the similarity between two tracklets T® and as the average of the area 
of the intersection between t\. and divided by the area of their union: 


S{T\T^) 


1 ^ 


Given a tracklet T^, it will be added to the eBoT T, if the similarity between 
and all tracklets in T is high enough. In this work, we experimentally found that the 
threshold 0.2 to include a tracklet in an eBoT provided the optimal results. Before 
adding tracklets to an eBoT, we sort them based on their similarity to the first tracklet 
in the eBoT. Since the next tracklets need to be compared to the existing tracklets in 
an eBoT, sorting tracklets prior to other computations, helps avoiding aggregation of 
biased tracklets in the eBoT. 

The similarity of tracklet, to the eBoT, T is defined as the average of the simi¬ 
larities to all its tracklets: 


5(T^T) = ^ ^ 5(T^r) (2) 

T'GT.TVT’'’' 

where |T| is the number of tracklets in the eBoT. After grouping by similarity, all 
tracklets in an eBoT are very likely to correspond to the same person. 

However, not all tracklets in an eBoT are equally reliable. In addition, some eBoTs 
may correspond to seeds that are false positive detections. While the first issue is 
related to the prototype extraction and will be addressed in the next subsection, here 


10 



we detail how to remove unreliable eBoTs that do not correspond to any person in the 
video. To this end, we define the density of an eBoT as (i(T) = jjy, where |T| is the 
number of its tracklets and |T| is the length of the sequence. 

Ideally, the density should be equal to 1 and we would have as much tracklets in 
the eBoT, as the number of frames the person persisted in the video. In practice, since 
the face detection algorithm as well as the matching algorithm may generate unreliable 
detections, the eBoT is looking for the consensus between the different tracklets to 
obtain the right tracking outcome. As expected, reliable eBoTs show different behavior 
from unreliable ones, having the latter very low density. Based on this observation, we 
discard as unreliable all eBoTs having a density lower than a predefined threshold. In 
this work, we empirically found that a threshold of 0.2 gives good results. By excluding 
unreliable eBoTs, we obtain as number of eBoTs as the number of persons in that 
sequence (see Fig. Q. 
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Figure 4; Example of a reliable eBoT -after excluding unreliable eBoT- extracted from 
the sequence in Fig. Each row shows a tracklet in the eBoT which totally consists of 
7 tracklets. The red box in each row indicates the seed of that tracklet and green boxes 
to the samples with highest average deep matching score to their corresponding seed. 
As can be appreciated, all tracklets in the eBoT correspond to the same person. 


2.3. Prototype extraction 

A prototype extracted from an eBoT T should represent all tracklets in the eBoT. 
Thus, it should localize the face of a person in every frame. Since the detection of the 
target in a given frame of the sequence varies, depending on the seed that generated the 
tracklet, we choose as the prototype frame the one whose bounding box has the biggest 
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intersection with the rest of the tracklets in that frame, namely: 


T = {tb,. ■. ,tk, ■ ■ ■ ,te}, SO that tk = arg max 

z=l,...|T| 






where |T| is the number of tracklets in the eBoT, {t]., are the bounding boxes of 

detected faces in the k-th frame of tracklets T® and from the eBoT T, respectively. 



(a) 



(b) 

Figure 5: Two Prototypes extracted for the two persons in the sequence. 


Fig. 1^ shows two prototypes, each of them extracted from separate eBoT where 
only one of them is shown Fig. Note how the prototype correctly tracks the person 
although the face detector misses the person in several frames. Missed detections can 
be seen in Fig. 

2.4. Occlusion treatment 

Beside optimizing the localization of the target, a good prototype should also in¬ 
dicate the presence of occlusions or unreliable detections. In order to increase the 
accuracy of the method, we detect in the hnal prototype those frames, where the target 
is fully or partially occluded or there is an unreliable detection. To this goal, we dehne 
a function A(f® , t\), that associates to each bounding box, t\ of a tracklet T® the value 
of the deep matching score to its seed f®. We dehne a frame confidence as the average 
of the normalized deep matching scores of its bounding boxes of all the tracklets of the 


13 


same eBoT, in that frame, that is: 


, |T| 

= ( 3 ) 

In equatiorj^ Ck is the frame conhdence, |T| is the number of tracklets in the eBoT, t\ 
is the seed of the i-th tracklet of the eBoT and t], is the bounding box of frame k of 
the i-th eBoT tracklet. The deep matching scores between bounding boxes in the eBoT 
have been normalized between zero and one. 

When there is a severe or partial occlusion of the face, or the target is missing, the 
conhdence of the eBoT on that frame Ck experiences a drop. This phenomenon can be 
observed in Fig. where, due to partial occlusion of faces in frames 5 and 6 in Fig. 
[7] (a) and frames 6 in Fig. [7](b), the conhdence value in these frames has a minimum 
and lies under the pre-dehned threshold for occlusion estimation. In all the cases of 
occlusions that are shown in Fig. [^(a) and (b), the face of the person is only partially 
occluded. This fact shows the robustness of the method in estimating large changes in 
face appearance. 

The value of the threshold for estimating occlusions, say L, is calculated over a 
subset of 15 sequences that constitute the training dataset. Fig. j^shows the normalized 
conhdence value calculated using equation]^ for frames where the target is occluded 
(left) and for frames where the target is not occluded (right). For non-occluded frames 
we used the groundtruth tracklet to compute the conhdence values, whereas for oc¬ 
cluded frames we generate a fake-tracklet by randomly dehning a bounding box where 
there is not a face. As a tracklet is generated for each seed, in Fig. |^we plot on the 
left the median value and the mean value of deep matching score over all the generated 
fake-tracklets and on the right the median value and the mean value of deep matching 
score over all the groundtruth tracklets over a sequence. The threshold L (black line), 
emerges from the median of all the median conhdence values over occluded frames. 
We obtained this value as L = 0.12. 

After estimating occlusions, we rehne the frame conhdence presented in equation 
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Figure 6: Normalized confidence value for fake tracklets generated from an occluded 
target (left) and for groundtruth tracklets (right). The threshold L we use to estimate 
occlusions is depicted in black. 


considering it zero for occluded frames, that is: 


ITtESi if |^Ei=i Ate.ffc) > L 

0, otherwise 


,|T| 


Ck = 


(4) 
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(a) 



(b) 


Figure 7: Frame confidence of two prototypes shown in Fig[^ as defined in equation 
Q. The occurrence of occlusion for every person in the sequence in the groundtruth 
is shown by red stars in the plot. The black line corresponds to L, the threshold de¬ 
termined to estimate occlusions. As can be seen, the occurrence of the face occlusion 
indicated in the groundtruth, highly coincides with the calculated confidence drop of 
the face in that frame. 


2.5. Confidence of prototypes 

A prototype can be very useful as a basis for applications, such as finding type 
of a social interaction and social roles. Thus, confidence estimation of an extracted 
prototype is a valuable task. We define the prototype confidence as the mean confidence 
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over all its frames that do not undergo occlusion weighted by a term that penalizes 
occlusions, that is: 

= ^ E C(4)xmaa:((l + /?log((|f|-z)/|f|)),0) (5^ 

where iTj is the length of the prototype, z is the number of frames, where the face is 
occluded or missing, and /? is a control parameter that depends on the performance of 
the detector (we found that /3 = 1 gives reasonable results). Note that, in absence of 
occlusion, the confidence from equation Q and equation Q are the same. 

Equationj^is inspired from the dehnition of tracklet confidence given by Bae and 
Yoon in Multi-Object Tracking based on Tracklet Confidence ll29l . The first term is 
related to the coherence in appearance of the target along the tracklet: a more coherent 
appearance in a tracklet increases the conhdence of the tracklet. The second term is 
related to the continuity of the tracklet: it decreases for occluded tracklets. Therefore, 
the hnal prototype should have a larger confidence than all the tracklets in an eBoT. 
After estimating occlusions for the prototypes, we associate a conhdence value to each 
tracklet of the eBoT by using equation 0. and verify that the conhdence of the pro¬ 
totype is higher than the highest tracklet conhdence in the eBoT. After evaluation, the 
average conhdence value of all prototypes in our test set has a value of 0.54, which is 
higher than the average of the conhdence value of all the tracklets in all eBoTs, being 
0.32. 

3. Experiments and discussion 

3.1. Dataset 

Currently, there is no dataset for person tracking with groundtruth information in 
egocentric photo-streams. Hence, to measure the performance of the proposed model, 
we created a dataset acquired by the Narrative Clip earner^ We manually annotated 
the sequences that contain trackable people and localized the position of their faces. 


* http://getnarrative.com/ 
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The dataset has been acquired by five users of different ages. Each user wore the cam¬ 
era for a number of non-consecutive days over an 80 days period, collecting ^20.000 
images. Our dataset contains a total number of 108 different tradeable persons along 
80 sequences of average length of 25 frame^ Table [^provides further details of the 
proposed dataset. 

Table 1; Detailed breakdown of our dataset made of ~20.000 images captured by 5 
users 


User 

Days 

Total 

frames 

Total frames 

with person(s) 

Total frames 

with occlusion 

Average daily 

duration 

1 

30 

6478 

680 

53 

8h 

2 

5 

1228 

125 

17 

8h 

3 

10 

3428 

220 

27 

8h 

4 

28 

6894 

850 

96 

8h 

5 

7 

2178 

425 

22 

6h 


3.2. Experimental setup 

After partitioning a photo-stream captured by the Narrative Clip into segments, a 
face detector is applied to exclude non-trackable segments and generate possible seeds 
for trackable segments, called sequences. Then, a tracklet is generated for each seed in 
a sequence. Finally, the tracklets are grouped into eBoTs and a final prototype with es¬ 
timated occlusion is extracted from each reliable eBoT. These prototypes constitute the 
final output of our method. In the next section, quantitative and qualitative comparison 
between our approach and other tracking approaches is provided. 

We measured the performance of our method by using the CLEAR MOT on 
the resulting prototypes (with and without occlusion estimation). Additionally, we 
compared its performance with other six state of the art methods. CLEAR MOT con¬ 
sists of multiple metrics as follows. The Multiple Object Tracking Precision (MOTP) 


^The dataset and the code will be made public domain, with the publication of the article. 
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evaluates the intersection area over the union area of the bounding boxes; 


MOTP = 


1 



ItkHgtkl 


\Ms\ 


k^Ms 


where Mg is the set of frames in a sequence in which the tracked bounding box tk 
intersects the groundtruth bounding box gtk, and \Ms \ is the cardinality of Mg. MOTP 
quantifies the accuracy of the tracker by estimating the precise location of the object, 
regardless of its ability in keeping consistent trajectories. 

On the other side, the Multiple Object Tracking Accuracy (MOTA) estimates the 
accuracy of the results by penalizing False Negatives (FN), False Positives (FP) and 
IDentity Switching (IDS), namely; 


Y!k=i{FNk + FPk + IDSk) 

EUgt, 


MOTA = 1 - 


where k refers to the frame number, I is the length of the sequence, and GTk states for 
the number of faces in the ground-truth to be tracked at frame k. FNi^ and donate 
the number of false negatives and false positives in a frame k, respectively. IDSk is 
equal to 1 when the detection does not overlap with its corresponding groundtruth face 
target, but with another face. 

Both metrics intuitively express the overall strength of each tracker and are suitable 
for general performance evaluations. Furthermore, the qualitative comparative results 
are also shown over four different sequences in the next section. 

3.3. Discussion 

Quantitative evaluation: To the best of our knowledge, the only work which is ex¬ 
clusively introduced for person tracking in egocentric photo-streams is BoT ll23l . Most 
of the available tracking techniques are not directly applicable to egocentric photo¬ 
streams, since they follow assumptions such as temporal consistency between frames 
or smooth variation in target and background appearance, that do not hold for egocen¬ 
tric photo-streams. Still, we compared our approach to six different state of the art 
algorithms that are applicable to egocentric photo-streams, since they do not rely on 
motion information nor background modeling. The selected trackers are designed for 
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tracking one object at time, but in our dataset more than one person appears in the se¬ 
quence. Thus, we applied the trackers separately for each person to adapt them to our 
scenario. In this case, the tracking problem reduces to one object tracking and therefore 
for evaluation measurements we do not consider the IDS metric for these methods as 
proposed by Smeulders et al. in Ga. In Table we show the percentage of MOTP, 
MOTA, FP, FN and IDS on the results of AMT 123, BoT EH, CT US), LOT 1181, 
LIO GH, and SPT G3- We also show how the estimation of occlusions improves the 
performance of the proposed method in most of the metrics. 

Table 2; Performance comparison 


Methods 

MOTPf 

MOTAt 

FP; 

FNi 

IDSi 

AMT (Abrupt Motion Tracking) 

60.99% 

59.65% 

16.70% 

23.65% 

- 

BoT (Bag of Tracklets) 

48.39% 

43.44% 

22.9% 

20.17% 

14.30% 

CT (Compressive Tracking) 

35.05% 

15.32% 

33.07% 

51.61% 

- 

LOT (Locally Orderless Tracking) 

42.27% 

15.57% 

33.12% 

51.13% 

- 

LIO (LI Tracker with Occlusion Detection ) 

37.25% 

25.87% 

31.81% 

42.32% 

- 

SPT (SuperPixel Tracking) 

40.75% 

39.31% 

23.56% 

37.13% 

- 

eBoT (prototype, occlusions not excluded) 

eBoT (prototype, occlusions excluded) 

68.32% 

70.27% 

72.08% 

80.23% 

15.19% 

5.12% 

10.60% 

12.51% 

2.13% 

2.13% 


As can be observed, the difference among CT, LOT, LIO, and SPT in terms of 
precision (MOTP) is small, where CT has the smallest value. This can happen, since 
this tracker does not change the scale of the bounding box, while other methods have 
a relatively good mechanism of scale adaptation. BoT and AMT have higher precision 
than other methods, being AMT that outperforms BoT. This can be justihed in the way 
that in AMT the true object is introduced for the tracker in the initial frame of the 
sequence, whereas BoT is fully automatic. 

In terms of accuracy (MOTA), CT and LOT performs much the same as each other. 
This might be a consequence of the fact that regular appearance model updates for both 
trackers, thus they fail when they encounter a large variation between frames. However, 
LIO and SPT perform slightly better, since they are able to estimate occlusions, leading 
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to lower amount of FPs. SPT and LOT use superpixels representation, which is more 
suited for bigger objects. Thus, they perform better, when the face is closer to the 
camera and looks bigger. On the other hand, AMT is designed for tracking on low 
frame rate videos and performs quite good on our dataset, being able to outperform 
BoT. However, it can easily miscalculate the position of the target, when there are more 
than one face in the frame. The miscalculation may happen due to use of a color-based 
likelihood model that can easily get misled by finding a region with similar colors to 
the target. 

As one can see in the lower part of the Table the method proposed in this paper 
performs much better than the state of the art. The seventh and the eight lines in the 
Tableshow evaluation metrics obtained before and after estimating occlusions. The 
estimation of occlusions allows to reduce FP, while slightly increases the FN rate due to 
wrongly eliminating some true detections in the final prototypes. The proposed method 
for prototype extraction allows to drastically reduce FP, FN and IDS, since it optimizes 
the localization of the detection. 

From this evaluation, we can state that the proposed system can robustly track 
multiple person’s face under challenging conditions. Moreover, this improvement is 
achieved without relying on any strong assumptions and without the need of a cumber¬ 
some training stage. 

Qualitative evaluation: The tracking results of the proposed approach together 
with the results of previously introduced trackers is shown over four different sequences 
in Fig. Fig. 10 Fig. [m and Fig. Every sequence contains multiple persons and 
tracking result of each tracker is shown by a specific color in every frame of the se¬ 
quences. The result of the proposed approach is shown by a red bounding box around 
the face of the person. In the frame, where our method detects an occlusion, no bound¬ 
ing box is shown. For the sake of visualization, if a sequence contains more than one 
person, the tracking result for each person is shown in a separate line. Fig. |^shows the 
final prototypes with estimated occlusions of the prototypes shown in Fig. Fig. 
and Fig. 11 show the result for a sequence of two different persons and Fig. shows 
them for a sequence of three different persons. 
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Figure 8: Results of applying different methods on an egocentric photo-stream. Dif¬ 
ferent bounding boxes show the tracking results of the CT, LOT, , SPT, LIO and 
our proposed approach. Occlusions can be observed in frame #9 (a) and frames #4 and 
#9 (b). 



Figure 9; Results of applying different methods on an egocentric photo-stream. Dif¬ 
ferent bounding boxes show the tracking results of the CT, LOT, , SPT, LIO and 
our proposed approach. Occlusions can be observed in frame #5 (a) and frame #6 (c). 
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(a) 



(b) 


Figure 10: Results of applying different methods on an egocentric photo-stream. Dif¬ 
ferent bounding boxes show the tracking results of the CT, LOT, , SPT, LIO and 
our proposed approach. 


Among the state of the art methods, AMT has the best performance on our dataset, 
because it was designed to cope with abrupt motion changes. However, it can easily 
produce FPs in presence of multiple persons for not being a multi-tracking method. As 
can be observed, CT, LOT, LIO, and SPT are disable to find the target, when its location 
varies largely. In addition, a common drawback among the AMT, BoT, CT, and LOT 
is that they are unable to localize target occlusion. As expected, it can be seen that the 
tracking results of the proposed approach highly match the person face. However, the 
method assigns a wrong region to the track, when a person face is occluded, causing 
the occurrence of FPs or IDS. Still, our method is able to precisely estimate occlusions 
or wrongly assigned detection. 

From our experiments, we could observe that the proposed method works better, 
when the people are closer to the camera. As the distance of the people from the camera 
increases, the resolution of the image on their face region decays. That phenomenon 
leads to generation of less seeds by the face detector and to unreliable matches by the 
deep matching approach. The illumination condition is another important factor as 
well. eBoTs is quite robust to illumination changes, although it performs better, when 
the images are not too dark. 
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(b) 


Figure 11: Results of applying different methods on an egocentric photo-stream. Dif¬ 
ferent bounding boxes show the tracking results of the CT, LOT, , SPT, LIO and 
our proposed approach. Occlusions can be observed in frame #3 (a) and frame #10 (b). 


3.4. Complexity analysis 

Regarding the complexity of our algorithm, one can easily see that the most ex¬ 
pensive part is the construction of the tracklets, where the deep matching is applied 
with a sliding window procedure to all windows having a similar color to the seed in 
the HSV color space. The most expensive part of the deep matching algorithm lies 
in the computation of the first level convolutions. However, the computational burden 
would be mitigated by using a GPU or a faster matching algorithm Eol, that achieves 
similar performances. Finding the optimal matching score among all feasible non-rigid 
warpings for all square patches at different scales, from the first image at all locations 
in the second image can be done with complexity 0{PP'), where P and P' are the 
number of pixels of both images. Usually, the size of the seed image is between 5000 
and 6000 pixels and the number of samples to be considered is about 2000. On a CPU 
Intel i5 - 2.53 GHz, with operating system Windows 7-64 bit, 4G of RAM, it takes 
in average about 1 minutes per each pair of images to find the similar candidate to the 
seed. It is easy to see that the complexity of the rest of algorithms to construct the eBoT 
and extract the prototype is 0{M * N'^), where M is the number of faces appearing 
in the sequence and N is the length of the sequence, taking less than a minute in the 
aforementioned computer. 
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4. Conclusions 


In this work, we proposed a novel method to track multiple-faces in low temporal 
resolution sequences acquired by wearable cameras, that is of high interest to analyze 
social events and social interactions in egocentric vision. Relying on the extended bag- 
of-tracklets approach for tracking a person increases the robustness and efficiency of 
our method. To deal with various types of object-induced or camera-induced image de¬ 
formations, tracklets are computed by using the average deep-matching score between 
the seed and each sample in different frames. Moreover, in order to extract the hnal 
prototype, eBoT introduces a useful measure of conhdence to estimate and discard 
occlusions and missed detections. 

A quantitative comparison of in a dataset of 20.000 images between our model 
and other six state of the art methods showed its advantage under drastic changes of 
poses, scales and object appearances. Future work will be devoted to quantify the kind 
of interaction with the camera wearer as well as to detect and classify social events. 
Human memories are influenced by emotions and strong emotional impact of social 
interaction is well acknowledged. Thus, a direct application will be to use the extracted 
prototypes for cognitive training of patients with mild cognitive impairment. 
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